Goto

Collaborating Authors

 semantic relationship


OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Bommarito, Michael J. II

arXiv.org Artificial Intelligence

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.





Supplementary Cross Modal Retrieval For Function Level Binary Source Code Matching

Neural Information Processing Systems

We show the preprocessing, training and testing time of the models in Table 1. BinPro and B2SFinder also need to use traditional matching algorithms to compute the similarity scores. In comparison with the pre-training models, the end-to-end models get rid of the pre-training time. Compared with random sampling, our norm weighted sampling method requires more time. However, the additional time consumption is worth, because the effect has improved a lot.


TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

Ezerceli, Özay, Gümüşçekiçci, Gizem, Erkoç, Tuğba, Özenç, Berke

arXiv.org Artificial Intelligence

This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.


Mitigating Semantic Collapse in Partially Relevant Video Retrieval

Moon, WonJun, Jung, MinSeok, Park, Gilhan, Kim, Tae-Young, Cho, Cheol-Ho, Jun, Woojin, Heo, Jae-Pil

arXiv.org Artificial Intelligence

Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.


Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers

Sanghvi, Samyak S.

arXiv.org Artificial Intelligence

Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym--synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.


PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Rengarajan, Nanda Kumar, Yan, Jun, Wang, Chun

arXiv.org Artificial Intelligence

Abstract--Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power . Index T erms--Named Entity Recognition (NER), Few-Shot Learning, Large Language Models (LLMs), Instruction T uning, Data Augmentation. Named Entity Recognition (NER) is a foundational task in Natural Language Processing (NLP), enabling applications like information extraction, question answering, and event detection [1]. Traditional NER systems rely on supervised learning, requiring extensive annotated data for specific domains and predefined entity types. This dependency on large, labelled datasets limits their adaptability to new domains and entity categories.


Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

Alkhalefi, Mohammad, Leontidis, Georgios, Zhong, Mingjun

arXiv.org Artificial Intelligence

Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views. While this approach facilitates the acquisition of invariant representations for dataset instances under various handcrafted transformations (e.g., random cropping, color jittering), an exclusive reliance on such data transformations for achieving invariance may inherently limit the model's generalization to unseen datasets and diverse downstream tasks. The inherent limitation stems from the fact that the finite set of transformations within the data processing pipeline is unable to encompass the full spectrum of potential data variations. In this study, we provide the technical foundation for leveraging semantic pairs to enhance the generalization of the model's representation and empirically demonstrate that incorporating semantic pairs mitigates the issue of limited transformation coverage. Specifically, we propose that exposing the model to semantic pairs (i.e., two instances belonging to the same semantic category) introduces varied real-world scene contexts, thereby fostering the development of more generalizable object representations. To validate this hypothesis, we constructed and released a novel dataset comprising curated semantic pairs and conducted extensive experimentation to empirically establish that their inclusion enables the model to learn more general representations, ultimately leading to improved performance across diverse downstream tasks.